Summer 2025, Pre-Assignment

Part 3: Introduction to Data Visualization

  1. Preparation: Please complete these tutorials before starting this notebook. We strongly recommend taking notes below and typing in the code yourself as you follow this tutorial.
  1. Use the following FAQs in case of doubts. Any other questions? Post them on Slack.
# INSTALLATION CODE:

# This code block installs a few extra packages you will need
# This may take a few minutes, the icon on the left will spin.
# When it stops spinning it is complete.
# When we get to meet in-person, we are going to learn other forms to load your
# packages.

pkgs <- c("tidyverse", "ggrepel", "gapminder", "maps", "ggthemes")

to_install <- which(!(pkgs %in% rownames(installed.packages())))

install.packages(pkgs[to_install])

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(gapminder)
library(maps)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
# Datasets like iris, mpg, gapminder, etc. are all available for you to use here.

# Here, we recreate the Asia dataset in the tutorial for you.
# Filter down to relevant countries
asia <- gapminder |> 
  filter(
    country %in% c("China", "Japan", "Korea, Rep.", "Korea, Dem. Rep."))

# Rename four Asian countries to use
asia <- asia |> 
  mutate(
    country = case_when(
      country == "Korea, Rep." ~ "South Korea",
      country == "Korea, Dem. Rep." ~ "North Korea",
      country == "China" ~ "China",
      country == "Japan" ~ "Japan")
  )

cat("Done!")
## Done!
# If needed: 
# TAKE NOTES HERE for the Primers

As usual, once you have completed the RStudio tutorials above, please start here and continue the document below. These exercises give you additional opportunities to practice the most important concepts from the RStudio Tutorials for our HKS courses.

1. Building a plot

Looking at a dataset is nice, but we will often want to visualize our data. R has incredibly powerful tools for data visualization.

To start, load the tidyverse library and read this dataset on US presidential election results from 1932 to 2016 by state.

# Run this code, loads libraries and does some setup
library(tidyverse)
options(repr.plot.width=10, repr.plot.height=10)
theme_set(theme_gray(base_size = 20))
theme_set(theme_dark(base_size = 20))
theme_set(theme_linedraw(base_size = 30,  ))
update_geom_defaults("point",list(size=5))
update_geom_defaults("line",list(lwd=1.5))

elections <- read_csv("https://www.dropbox.com/s/lhp9nets5qb2rhe/presidential_elections.csv?dl=1")
## Rows: 1097 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, abb, region
## dbl (2): democrat, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# show first 10 rows
head(elections, n = 10)
## # A tibble: 10 × 5
##    state       abb   democrat  year region   
##    <chr>       <chr>    <dbl> <dbl> <chr>    
##  1 Alabama     AL        84.8  1932 South    
##  2 Arizona     AZ        67.0  1932 West     
##  3 Arkansas    AR        86.3  1932 South    
##  4 California  CA        58.4  1932 West     
##  5 Colorado    CO        54.8  1932 West     
##  6 Connecticut CT        47.4  1932 Northeast
##  7 Delaware    DE        48.1  1932 South    
##  8 Florida     FL        74.5  1932 South    
##  9 Georgia     GA        91.6  1932 South    
## 10 Idaho       ID        58.7  1932 West

Let’s look at the results for Massachusetts:

# Don't forget to run the code above first to read in the data
ma <- elections |> 
  filter(state == "Massachusetts")

head(ma)
## # A tibble: 6 × 5
##   state         abb   democrat  year region   
##   <chr>         <chr>    <dbl> <dbl> <chr>    
## 1 Massachusetts MA        50.6  1932 Northeast
## 2 Massachusetts MA        51.2  1936 Northeast
## 3 Massachusetts MA        53.1  1940 Northeast
## 4 Massachusetts MA        52.8  1944 Northeast
## 5 Massachusetts MA        54.7  1948 Northeast
## 6 Massachusetts MA        45.5  1952 Northeast
# Practice by creating an object for a different state or time period.

As you might expect, there are many functions for plotting! As you learned in the RStudio tutorials, the starting point for every plot we make in this course is called ggplot().

Starting with a dataset, you can create a plot with year on the x-axis and democrat on the y-axis with:

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  )

ggplot() makes a plot for you, and the aes() function (short for “aesthetic”) describes the variables in the dataset that you want on the x and y axis (for now! We can use aes() for other things too later).

But it’s empty! To get shapes to appear on the plot, we need to ask for a particular geom (short for “geometry”). A geom in R is a way to visualize the data, like a point, a line, or a shape. To further customize this plot, we simply add a geom for the shape we want. Let’s use geom_line() to make a line:

Hint: if the plot below looks too small on your computer, you can click the “show in new window” icon at the top right corner of the plot.

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line()

# Try creating a line plot for a different state

Notice the + sign! We add a + sign between different pieces of a plot.

We could keep almost this exact code for a plot with a different geometry for points as well:

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_point()

You can also add both! Notice how the points appear on top of the line, since we added them after:

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line() +
  geom_point()

Exercises #1

  1. Okay, let’s all try this. Create a new object with the election results from one state other than Massachusetts. Use it to make a a line plot like we have above.
# Write your code here 
  1. Then, try to make a bar graph using geom_col() instead of points or lines.
# Write your code here
  1. Look back at the line, point, and bar plots you made. Are they all displaying the same information? Which one do you think is most effective?

Answer here in text!

2. Aesthetics

We added an x and y aesthetic to choose particular columns to display on our axes, but plots can accept many other arguments.

Colors

As you saw in the RStudio tutorials, if you want to make your geoms a certain color, that is very easy to do with the color argument:

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line(color = "grey") +
  geom_point(color = "blue")

This looks great, but what if we want the colors in our plots to depend on the value of the data? For example, red points for elections that Republicans won and blue for elections that Democrats won.

Then, people looking at our plot would see additional pieces of information beyond the values on the x and y axes.

Just like the x and y axes, if we want the color of the points to depend on values in the data we have to use a column in our dataset to define the colors. Let’s make a new column that shows whether the Democratic candidate won the election.

For a crude measure of the election winner, let’s use whether democrat is greater than 50 percent (this is too simple since more than two candidates can run, but it’s okay for now).

# Create a new column for a Democratic winner
ma <- ma |>
  mutate(
    winner = democrat > 50
  )

head(ma)
## # A tibble: 6 × 6
##   state         abb   democrat  year region    winner
##   <chr>         <chr>    <dbl> <dbl> <chr>     <lgl> 
## 1 Massachusetts MA        50.6  1932 Northeast TRUE  
## 2 Massachusetts MA        51.2  1936 Northeast TRUE  
## 3 Massachusetts MA        53.1  1940 Northeast TRUE  
## 4 Massachusetts MA        52.8  1944 Northeast TRUE  
## 5 Massachusetts MA        54.7  1948 Northeast TRUE  
## 6 Massachusetts MA        45.5  1952 Northeast FALSE

Remember how this code works: the column democrat in ma is really a vector. The code works very similarly to running something like:

democrat <- c(52, 37, 63)
democrat > 50
## [1]  TRUE FALSE  TRUE

If you want the color of the points to depend on the value of a column, then you can use the color argument in the aes() function. R will assign one color to each value in the winner vector. Since there are only TRUE and FALSE values in this column, all of the TRUE values will have one color and FALSE will have another.

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat, 
      color = winner
    )
  ) +
  geom_point()

What if we add the line back?

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat, 
      color = winner)
  ) +
  geom_point() +
  geom_line()

Uh-oh! What’s happening here? Well, we’ve asked the plot to change the color of our shapes according to the winner variable. Since we have both points and a line, the plot is trying to change the color of both.

What if we only want to change the color of the points depending on the value of winner? Well, we can include that aesthetic only in the geom_point() function.

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line() +
  geom_point(aes(color = winner))

Like before, you can still set the color of the line manually since you don’t want the color to vary by the value of a column. Make sure to do this outside of aes():

ma|>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line(color = "grey") +
  geom_point(aes(color = winner))

Size and shape

Similarly, you can have the size of a point depend on the value of a column. For example, see how values with a winner value of TRUE are larger below than values with FALSE:

ma|>
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(size = winner))
## Warning: Using size for a discrete variable is not advised.

Now, points are larger for larger values of winner! However, larger values of winner are already higher up on the y-axis, so this does not add much information to our plot.

The same is true for shape:

ma|>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line(color = "grey") +
  geom_point(aes(shape = winner))

Exercises #2

  1. Create a new column in the ma dataset called percent. The values should be equal the values in democrat divided by 100.
# Write your code here.
  1. Make a scatterplot for the ma object with year on the x-axis and percent on the y-axis.
# Write your code here.
  1. Create a new column in ma called modern which is TRUE for all elections after 1980 and FALSE for those before. Create a plot with year on the x-axis, democrat on the y-axis, color the points by winner, and vary the shape by modern.
# Write your code here.

3. Customizing your visualizations

Geometries and aesthetics are the core of a nice visualization. R gives you many many more tools to customize your plots any way you want. For example:

Labels

Labels are important in any plot. We create these with the labs() function, which has arguments for title, subitle, caption, x, and y labels. You can choose which labels to include in your plot. For example:

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line(color = "grey") +
  geom_point(aes(color = winner)) +
  labs(
    title = "Massachusetts Presidential \n Election Results",
    subtitle = "1932-2016",
    x = "Election Year",
    y = "Democratic %"
  )

You can also set your own axes in R – the minimum and maximum values on the x (horizontal) axis and y (vertical) axis. R will often try to pick them for you automatically, but sometimes you may want to choose your own.

The xlim() and ylim() functions will take a vector (specified by c()) with the smallest and largest values you want for that axis.

For example, R automatically chose a y-axis for the previous plot that stretched from around 40 to 70 because that’s where our values were. However, what if we wanted to make that go from 0 to 100? We could change ylim() like this:

ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line(color = "grey") +
  geom_point(aes(color = winner)) +
  labs(title = "Massachussets Presidential \n Election Results",
       subtitle = "1932-2016",
       x = "Election Year",
       y = "Democratic %") +
  ylim(c(0, 100)) # now 0 is the minimum, 100 is the maximum

Themes

Themes are simple ways to improve the presentation of your plot as well. We will learn how to make our own later, but for now you can use built-in themes. Some built-in themes include theme_bw(), theme_minimal(), and theme_dark().

For convenience, you can also store plots to an object and add additional features onto that object:

# save plot in an object called p
p <- ma |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line(color = "grey") +
  geom_point(aes(color = winner)) +
  labs(
    title = "Massachussets Presidential Election Results",
    subtitle = "1932-2016",
    x = "Election Year",
    y = "Democratic %"
  )

# now we can make more customizations to p
# without retyping everything
p + theme_minimal()

p + theme_dark()

There are many, many more themes available via packages like ggthemes.

library(ggthemes)

This opens up many many more themes for you, many of which are listed at this link. Here are a few:

p + theme_clean()

p + theme_fivethirtyeight() # 538

p + theme_igray()           # Gray background

p + theme_economist()       # The Economist

p + theme_stata()           # theme from a language called Stata

p + theme_solarized()

You can edit almost anything you want about a plot’s theme, even if you’ve already added a preset theme. Most of this works happens through the theme() function. You can run ?theme to get a full list of options. For example:

p +
  theme_bw() +
  theme(legend.position = "bottom")

Facets

Often, you will want to plot several groups at once. However, putting all information on one plot can be overwhelming. For example, consider this plot:

northeast <- elections |> 
  filter(region == "Northeast")

northeast |>
  mutate(winner = democrat > 50) |>
  ggplot(
    aes(
      x = year, 
      y = democrat, 
      color = winner
    )
  ) +
  geom_point()

Why is this so cluttered? Well, we are now plotting results from all states in the Northeast! We could color by state instead, but that might look overwhelming:

northeast |>
  ggplot(
    aes(
      x = year, 
      y = democrat,
      color = state
    )
  ) +
  geom_point()

Wow! That looks terrible. Instead, what if we plotted a separate line for each state?

northeast |>
  ggplot(
    aes(
      x = year, 
      y = democrat,
      color = state
    )
  ) +
  geom_point() +
  geom_line()

That looks a little better, but it is still difficult to tell each line apart from one another. What if we made a smaller plot for each state and combined them? This is what a facet is. If we ask for a facet_wrap() by state, R will make one plot per state:

northeast |>
  ggplot(aes(x = year, y = democrat)) +
  geom_point() +
  geom_line() +
  facet_wrap(~state) + # notice the ~ key (called a tilde)
  theme_linedraw()

We could also add the winner color back and facet_wrap() will automatically apply it to each plot:

northeast |>
  mutate(winner = democrat > 50) |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_line() +
  geom_point(aes(color = winner)) +
  facet_wrap(~state) + # notice the ~ key (called a tilde)
  theme_linedraw() +
  labs(
    x = "Election Year",
    y = "Democratic %",
    title = "Presidential Elections",
    subtitle = "1932-2016, Northeastern States"
  )

Does the font size look too small to you? There are many many ways of customizing ggplot() objects, many of which we will learn throughout the course. Here is a helpful cheatsheet with many of the options listed in case you would like to delve deeper into this.

# For example, functions like axis.text() and axis.title() change font sizes
# for particular places on your plot.
northeast |>
  ggplot(
    aes(
      x = year, 
      y = democrat
    )
  ) +
  geom_point() +
  geom_line() +
  facet_wrap(~state) + # notice the ~ key (called a tilde)
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 25),
    axis.text = element_text(size = 15),
    axis.title = element_text(size = 15)
  )

Exercises #3

  1. The pop dataset below contains state population data over time. For any state you want, make a plot showing population by year for every year after 1960.
pop <- read_csv("https://www.dropbox.com/s/javbnd4c3n67380/state_population.csv?dl=1")
## Rows: 6020 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): state, region
## dbl (2): year, population
## lgl (1): after2000
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Add labels and a theme to your plot from Question 1.
# Write your code here.
  1. Now, design a plot (or extend your plot from Question 2) that uses a facet in some way (a facet by state or region could be interesting, but feel free to be creative!).
# Write your code here.

Reminder to Submit

Please follow the submission instructions listed here. We suggest you submit your assignments as you finish them (i.e., don’t wait until you have completed them all to submit).